-
Notifications
You must be signed in to change notification settings - Fork 8
Use UnsafeAtomics to fix race in accumulate #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This alone is not sufficient to resolve the data-race. |
|
@anicusan I currently don't have a MWE that triggers anymore, but I believe this is closer to the spirit of what you intended to implement? |
|
This fixes #46 on AMGPU MI300 as per #46 (comment) |
c43a4a3 to
d4698ab
Compare
|
Apologies for the radio silence, I've been away for a conference. This is extremely useful, thank you for digging into this @vchuravy and @giordano . Two questions from my side:
|
|
UnsafeAtomics is stable and will be supported by JuliaGPU (it's the underpinning of Atomix).
Yeah, |
|
Might be worth a rebase since there have been changes to the Metal compilation fails because it only supports 32-bit integers (and float) for these atomic operations. However, Finally, I ran this on a 3080, and I got 1 failed test (test/accumulate.jl line 47 or 51 after rebase). I couldn't reproduce until I increased the # iterations for the "small block sizes -> many blocks" test to 10000 and then I get <10 failures. |
|
I really would prefer not undoing the Metal change without understanding the why. The memory semantics ought to be the same, and a we have seen with more capable micro architecture the latent bugs in this algorithm show up. You are seeing failures at large enough problem sizes on CUDA as well? |
I am. Also reproduces on the un-rebased version of this PR. Either more iterations or bigger arrays will increase the odds of triggering the bug. |
|
The Metal backend will not be able to support the DecoupledLookback algorithm - that was the primary reason for developing ScanPrefixes (issue / PR).
The decoupled lookback assumes that changes made to a device array ( We have large-scale accumulate tests though - at least enough to saturate the SMs - and I have not been able to reproduce accumulate bugs on CUDA / AMD; @christiangnrd do you have an MWE? @vchuravy on oneAPI we seem to have the following atomics error: Any ideas? |
|
We widen the semantics on Metal recently to match semantics across all backends. JuliaGPU/Metal.jl#609 So either this algorithm is sound everywhere or it is sound nowhere. For oneAPI @maleadt do you know of the top of your head which atomics are legal? |
|
Are you sure the Metal Shading Language supports that? See this post:
|
|
Further to the above, from the gpuweb/gpuweb#2297 discussion (who had to change their scan because of Metal alone):
Also, in the MSL v4 Spec, Table 6.13. "Memory flag enumeration values for barrier functions" shows (bold mine):
So your addition in JuliaGPU/Metal.jl#609 was correct anyways, but device memory writes are still only guaranteed to be visible from/within workgroups, not across them. I can't really reduce the problem to a smaller MWE than the What is left to do for this PR:
|
|
Related, @christiangnrd what did your benchmarks show regarding |
|
MWE: |
So results are closer with Int64 than Float32, but DecoupledLookback is slightly better/worse or much worse than the other 2. Run on a 3060. |
|
It seems |
Trying to fix #41 (comment)